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C^ ■ Abstract 



This paper proposes a type of pseudorandom number generator, Mersenne 
Twister for Graphic Processor (MTGP), for efficient generation on graptiic 
processessing units (GPUs). MTGP supports large state sizes sucti as 
ff2f3 bits, and uses the high parallehsm of GPUs in computing many 
^O ■ steps of the recursion in parallel. The second proposal is a parameter-set 

generator for MTGP, named MTGP Dynamic Creator (MTGPDG). MT- 
GPDC creates up to 2^^ distinct parameter sets which generate sequences 
C/3 , with high-dimensional uniformity. This facility is suitable for a large grid 

O ■ of GPUs where each GPU requires separate random number streams. 

MTGP is based on linear recursion over the two-element field, and 
f^ . has better high-dimensional equidistribution than the Mersenne Twister 

^ ■ pseudorandom number generator. 

[~^ , keywords: Pseudo Random Number Generator, Mersenne Twister, General- 

0^ ' Purpose Computing on Graphics Processing Units, Dynamic Creator 
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O; 1 Random number generation for GPU 



A Graphic Processing Unit (GPU) is a highly parallel processor designed for 
computer graphics. GPUs are now widely used in both personal computers 
and game machines, and are cheap despite their high computational power. As 
/\ \ a consequence, a trend in parallel computation is General-Purpose computing 

S ■ on Graphics Processing Units (GPGPV) |15j . namely, to utilize the high par- 

allelism of GPUs for solving computing problems outside the graphics domain. 
A number of recent super-computers actually consist of a large grid of GPUs, 
controlled by one or several CPUs. 

Pseudorandom number generators are often necessary in GPGPU, for exam- 
ple, in Monte Carlo simulations, so it is useful to design pseudorandom number 
generators taking advantage of the parallelism of GPUs. 

We propose a class of pseudorandom number generators, Mersenne Twister 
for Graphic Processors (MTGP). The algorithm is similar to that of Mersenne 
Twister (MT) [T7], but refined and adjusted to the hierarchical parallelism of 
GPUs. The parameter sets for generators are selected by their high-dimensional 
equidistribution properties. We prepared 128 different parameter sets for each 



of the periods 2"2i3 _ i^ 2^3^09 _ i and 2^4497 _ i. The gap between the theo- 
retical upper bound and the reaUzed dimensions of equidistribution of MTGPs 
is smaher than those of MT, so the MTGPs provide better statistical quality 
for the same amount of storage. 

We also designed a parameter-set generator for MTGP, named MTGP Dy- 
namic Creator (MTGPDC). Analogously to Dynamic Creator [H] for MT, MT- 
GPDC finds good parameter-sets for MTGP with high dimensions of equidis- 
tribution. Users specify a set of generator IDs and the desired Mersenne prime 
period, then the ID is embedded in the recursion parameters of each generator, 
so that generators with distinct IDs will yield independent streams. This is 
useful for large grids of CPUs, where each GPU needs one or more indepen- 
dent random number streams, as each stream can be generated from generator 
recursion parameters with a distinct ID. 

The design pohcies of MTGP and MTGPDC are as follows. 

1. Many parallel threads operate on the state space of one large pseudoran- 
dom number generator (PRNG). The large state allows a long period and 
high-dimensional equidistribution properties. This is in contrast to the 
usual approach to PRNG parallelism, namely, one generator per thread. 
In the latter case, the increase in the number of threads implies that each 
generator has a small state space, since the size of fast memory in a GPU 
is limited. 

2. CPUs have a hierarchy in memory: some memory is fast but its access is 
limited to a group of threads (called a block), and some memory is fast 
but read-only. MTGP takes into account the characteristics of each class 
of memory for efficient parallel generation. 

2 GPGPU and CUDA 

For GPGPU, a typical hardware setting is as follows. One CPU, called the host 
CPU, is connected to one GPU (often to a grid of CPUs, but for simplicity of 
explanation we choose the one-GPU case). Consider a computation C which 
one wants to do in GPGPU. As usual, C is divided into several parts, and 
some of the parts can be executed in parallel. To use the GPU effectively, 
one needs to analyze which part of the program can run in parallel, in the 
many processing units in the GPU. Then, one writes a program for the CPU, 
called the host program, which sends input data and GPU codes (called the 
kernel program) to the GPU for parallel execution. The GPU does the given 
computation, and returns the output data to the CPU. Usually the output data 
of GPU computation is a large array, so the data is written in a memory called 
global memory which is accessible from the CPU. 

Under this setting, the computer program must be equipped with facili- 
ties for scheduling execution of GPU kernels and communication between the 
CPU and the GPU. One solution is the Compute Unified Device Architecture 
(CUDA) [2^, [22], which is a widely used software-development environment for 
GPGPU on NVIDIA's CPUs. CUDA has a C-like language which supports 
these facilities. A CUDA program consists of one host program and several 
kernel programs. Some more concrete explanation is in § [S] 



MTGP is written in and executed in the CUDA-environment. Initialization 
of MTGP is done in the host program (i.e. by the CPU), and the host program 
calls a kernel program (i.e. invokes the GPU) which generates a specified number 
of pseudorandom numbers in the global memory (by the GPU) . MTGPDC is 
written in C and executed in the usual CPU (i.e. without GPU). 

Another programming environment for GPGPU is the OpenCL [5]. At 
present, there is no OpenCL version of MTGP/MTGPDC. We use CUDA's 
terminology, but put comments on OpenCL 's corresponding terminology when 
they differ. 

Hierarchical Structure: software and memory 

Since the hardware architecture of a GPU is rather complicated, we explain 
mainly its software-level hierarchy. When we mention concrete values as exam- 
ples, we use a middle-range GPU GTX260. 

In a kernel program, (namely, a program executed in the GPU), the mini- 
mum execution unit is called a thread {work item in OpenCL), which is one lane 
in a CUDA execution grid. A block {work group in OpenCL) consists of many 
threads, with some upper bound on their number (e.g. 512 under the present 
CUDA). We may think of these threads in one block as running in parallel. A 
GPU can run several blocks in parallel. 

To realize the high parallelism, threads and blocks have strong constraints in 
accessing memory and getting instructions. There is a corresponding hierarchy 
in memory. Each thread has its own set of registers, which is inaccessible from 
other threads. A block has its own shared memory {local memory in OpenCL) 
of size 16Kbyte (for most CUDA-enabled CPUs at present), inside the GPU 
chip. Any thread in one block can access the shared memory of the block, but 
can not access those of other blocks. A block has also its own read-only cache 
of the texture memory {image object in OpenCL), intended for storing texture 
information in computer graphics. 

The tightest restriction is that any thread in a block gets the same instruction 
sequence. Each thread has its own thread-ID number (consecutive) and can refer 
to that. Consequently, each thread acts differently according to its ID-number. 

Global memory is stored in external memory chips, located outside the GPU 
chip. Although our ideas on MTGP apply more generally, for the readers who 
may want to grasp the size and speed of GPU, we give illustrative examples: 
the size of the memory, in the case of GTX260, is typically 896Mbyte. Data- 
transfer speed between global memory and the GPU is 112Gbyte/sec, which is 
faster than typical CPU-memory transfer speeds (eg. 26Gbyte/sec.) Still, the 
global memory is slower to access than the shared memory inside the GPU. 
Unlike the shared memory, all threads in the GPU, regardless of which block 
they belong to, can access the global memory. Different blocks can exchange 
information only via the global memory. The shared memory is grouped into 
16 banks, according to the address (as memory of 32-bit words) modulo 16. 
The threads in one block are grouped in warps. A warp consists of 32 threads 
in the present CUDA-enabled CPUs. A half warp (16 threads in a warp) may 
simultaneously access the shared memory, only if each thread accesses to a 
mutually distinct bank. If two or more threads in a half warp access the same 
bank (namely, the memory addresses coincide modulo 16), then they can not 
access in parallel. This phenomenon is called a bank conflict. 



The cache of the texture memory can be read by many threads simultane- 
ously, differently from the shared memory where bank-conflicts may occur. 

3 Mersenne Twister for GP (MTGP) 

3.1 Pseudorandom number generators by recursion 

Let W be the set of w-bit {w: word size) integers, and Xi ^ W {i = 0,1,2, . . .) 
be a sequence of w-hit integers. 

For generating pseudorandom numbers, it is common to use a recursion: 
Let A^ be a positive integer (called the degree), f : W^ —> W he a. function, 
and a;o,a;i, . . . ,a;Ar_i be elements of W (called the initial values). Then, the 
following recursion generates a sequence of elements in W: 

XN+i'-^ f{xN-l+i,XN-2+i,---,Xi) (i = 0, 1,...). (1) 

For high-speed generation, it is better to choose an / depending only on few 
variables, but if the variables are too few, the generated sequence tends to show 
non-randomness. As a trade-off, we consider the following type of recursion: 

XN+i-^f{xM+i,Xi+i,Xi) (i = 0,l,...). (2) 

We call the positive integer M {\ < M < N) the middle position. 

Such a recursion can be efficiently computable (under an appropriate choice 
of /) using an array X[Q..L — 1] of words of length L with L > N (see [9j 
Algorithm A, p.28]), as follows. 

1. Store the initial values xq, . . . , xn-i to A"[0], A"[l], . . . ,X[N — 1]. 

2. Set an integer variable i to 0. 

3. Set 

X[{N + i) mod L] ^ f{X[{M + i) mod L],X[{1 + i) mod L\,X[i mod L\). 
This computes x^+i. 

4. Increment i<— i + 1. li i > L, then i -(^ i mod L. 

5. Return to step[3l 

3.2 Parallel computation of a single recursion 

Suppose that L is sufficiently large, namely L > 2N — M . Then, one may 
compute 

X[N + ^] ^ f{X[M + i],X[l + i\,X[i]) 

for < i < A^ — Af — 1 in parallel. When i — N — M , the computation of 
X[N ~\- i] requires the value of X[M + i] = X[N], which can be obtained only 
after the i = 0-th step has been done, giving the above upper bound N — M for 
the number of parallel threads. 

A basic strategy in our proposed MTGP is to use this parallelism for threads 
in one block, as pictured in Figure [TJ One block works on a single recursion 
using up to iV — M threads in parallel. Suppose that one block has n threads 
[n < N — M). Then, the i-th thread {1 <i < n) works as follows: 
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Figure 1: Strategy of MTGP: one block works on a single recursion using up to 
N — M threads. At Step 1, the threads read from memory via arrows numbered 
1 in parallel, then at Step 2 (respectively 3) via arrows numbered 2 (respectively 
3). At Step 4, the threads write the generated numbers via arrows numbered 4. 



Step 1 reads X[i - 1] (1 < i < n). 

Step 2 reads X[i\ {1 <i<n). 

Step 3 reads X[M + i - 1] {I <i<n). 

Step 4 computes /(X[M+2-l],X[z],X[i-l]) and writes the result to X[iV+i] 
(1 < i < n). 

This idea of one block for one generator is in contrast to a simpler idea: one 
thread for one generator, adopted in CUDA SDK Mersenne Twister sample, 
which will be explained in ij4.1l 

3.3 F2-linear generators and Mersenne Twister 

We briefly recall the notion of F2-linear generators, in particular Mersenne 
Twister (MT) [27j generator, since MTGP is its variant. See |13j for a gen- 
eral theory on F2-linear generators. 

In this article, we identify the set of bits {0, 1} with the two-element field 
F2. A w-bit integer is identified with a horizontal vector in F^", and © denotes 
the sum as vectors (i.e., bit-wise exor). Thus, the set of word-size integers is an 
F2-lincar space, as well as the set of states of a memory array, etc. We mean by 
an F2-linear generator a pseudorandom number generator with F2-linear vector 
state space, F2-linear transition function, and F2-linear output function. 



Notation 

The following notations on bit-operations on words are used. The bitwise-and is 
denoted by &. Let r be an integer with 1 < r < w and x, y be two w-hit words. 
The {w — r)-bit integer consisting of the {w — r) most significant bits (MSBs) 
of X is denoted by x"'"'". The r-bit integer consisting of the r least significant 
bits (LSBs) of y is denoted by y*". The w-bit integer obtained by concatenating 



x^~^ and y*" in this order is denoted by (x'"~''|y''). The bold means the zero 
word. Thus, (x"'~''|0'') is a word whose most significant w ~ r bits coincide 
with those of x and the other r bits being 0. The logical left shift of x by r 
bits is denoted by (x <; r), and the right shift by (x :^ r). A hexadecimal 
number is denoted with Ox at the head, so for example Oxf denotes the integer 
15 whose binary representation is 1111. Matrices are denoted by bold letters 
such as A and R, and multiplication to a row vector x is denoted by xA, and 
every component is computed modulo 2, i.e., in F2. 

Choose a Mersenne prime, i.e., a prime number of the form of 2^ — 1; the 
integer p is called a Mersenne exponent (MEXP). A basic strategy of MT is 
to realize the Mersenne prime period. For this purpose, put TV = [p/w] , r = 
wN—p. Thus, N is the least length of array of w-bit integers that accommodates 
p bits. MT generates a sequence of elements in F|' by a recursion 

xat+j = /(xA/+i,xi+i,5c7"'"'~) (3) 

= XM+. © (xr^Hx^^+JA, (4) 

where A is a (w x u')-matrix such that xA is computable by 

r (x»i) (ifxi = o) 

'''^"\ (x>l)®a (ifx^ = l), ^^> 

where a is a constant w-dimensional row vector. 
The function 

(,XjV+j:-l,Xjv+i_2, • ■ • ,Xi+l,X,- j »-> ^XAr+i,XAr+i_l, . . . ,X.i+2,Xj_|_^ ) 

is a fixed F2-linear transformation F, and the recursion is simply the iteration of 
F. Thus, Mersenne Twister is considered as an automaton with state transition 
function i^ : Fj ^- Fj, where p = wN — r. The period of F attains the maximum 
2^ — 1 if and only if the characteristic polynomial of F is primitive — there is 
an efficient algorithm to check this when p is a Mersenne exponent [17) . 

3.4 A;-distribution and tempering 

The quality of a pseudorandom number generator is often measured by the 
period, together with its high-dimensional equidistribution property, defined 
below (cf. E][n]). 

Definition 3.1 A sequence of v -bit integers with period P = 2^ — 1 

Xo,xi,...,xp_i,xp = xo,... 

is said to be fc-dimensionally equidistributed if the consecutive k-tuples 

(xi,Xi+i,...,Xj+fe_i), i = 0, ...,P- 1 

are uniformly distributed over all possible kv-bit patterns except the all-0 pattern, 
which occurs once less often, i.e., each distinct k-tuple ofv-bit words appears the 
same number of times in the sequence (with one exception of k-tuple of zeroes). 



Definition 3.2 A periodic sequence ofw-bit integers is fc-dimensionally equidis- 
tributed to w-bit accuracy if the most significant v-bit integer sequence is k- 
dimensionally equidistributed. 

The dimension of equidistribution to w-bit accuracy k(v) is the maximum 
value of k such that the sequence is k-dimensionally equidistributed to v-bit 
accuracy. Larger values of k{v) show higher dimensional equidistribution for 
v-bit accuracy. 

For P = 2P — 1, there is a bound k(v) < lp/v\. The dimension defect d{v) 
at V is the difference d{v) := \p/v\ — k{v) > 0, the total dimension defect A is 
their sum over v: A := X]u=i d{v). Smaller A value shows that the generator is 
closer to the optimum from the view point of k{v) (v = 1, . . . ,w). 

A possible alternative to A is the maximum dimension defect max{d(w)|w = 
1,2,..., w}. In this article we use only A. 

To obtain a better equidistribution property, MT chooses a {w x w) matrix 
T (called the tempering matrix), and outputs x^T, x^+iT, Xi-|_2T, . . .. The tem- 
pering matrix T is chosen so that each k{v) (in particular for small v) is close 
to its upper bound mentioned above; namely, so that the dimension defect d(v) 
and their sum A is small. 

3.5 Design of MTGP : using threads 

In designing MTGP, we utilize the parallelism explained in ii3.ll namely N — M 
threads can work in parallel on the state space. We chose the number of threads 
for degree- A'^ MTGP generators as the largest power of 2 no more than N — 2 
(here M > 2 implies N - M < N - 2), which we denote by [A^ - 2J2 = 
2Liog2(w-2)J -y^g choose a small M (but M > 2), so that the gap iV - M > 
[iV - 2J2 holds. We released MTGP for three MEXPs p = 11213, 23209, 44497. 
These choices correspond to the number of threads being 256, 512, and 1024, 
respectively. 

3.6 Description of MTGP: recursion 

Let us fix an MEXP p. Similarly to MT (see tj3.3p . we compute N — [p/w] and 
r = wN — p. 

MTGP generates a sequence of elements in F^ by a recursion 

XN+i := g(xM+i,Xi+i,xf^''), (6) 

where M is an integer satisfying the condition in the previous section. The F2- 
linear recurring function g is determined by six integer parameters N, M, w, r, shl, sh2 
and a (4 X ui) matrix R. 

Definition 3.3 The MTGP recursion ^ is defined as follows, using temporary 
variables t and u ofw-bit integer (identified with row vectors in F^j; 

t ^ xi+, © (xpno'') 

t ^ t ® (t < shl) 
u ^ t © (xA/+, > sh2) 
XAr+, ^u® (u^R), (7) 



with the computation in this order. In the last line, u^ denotes the four- 
dimensional row vector over ¥2 consisting of the four LSBs of u, and hence 
the multiplication u^R with (4 x w)-matrix R yields a w- dimensional vector 
(see notations in ^3.3\) . 

The parameters {N, M, w, r, shl, sh2, R) are called the recursion parameters 
of the MTGP, and chosen to realize the period 2"'^^'" — 1. 

For the speed, the muhiphcation of R is implemented as a look-up table. Since 
u'* may assume only 16 values 0000, 0001, . . . , 1111 in the binary form, we may 
precompute 16 w-dimensional vectors (OOOO)R, (OOOl)R, . . ., (llll)R and store 
them in an array of words, say, in rectbl [0. . 15] . Then, u'*R is rectbl[u*]. 

An equivalent description to the recursion ([7]) in C-language is as follows. 
We assume that the content of the array rectbl is precomputed as above. Note 
that the symbol A denotes the bitwise exor in C. 

t = X[l+i] A (X[i] & BITMASK(w,r)); 
t = t A (t«shl); 
u = t A (X[M+i]»sh2); 
X [N+i] = u A rectbl[u & Oxf ]; (8) 

Here, BITMASK(w,r) is the bitmask of a w-hit integer with {w — r) MSBs being 
1 and r LSBs being 0. 

The cache of the texture memory of each block is suitable for such a look-up 
table, because it is free from bank-conflicts. Note that because the values of u'* 
in distinct threads (even in a warp) may often coincide, the execution of ([5]) may 
often cause a bank-conflict if the array rectbl is kept in the shared memory. 
If we keep them in registers, the number of available registers for each thread 
decreases, hence the number of concurrently executable threads decreases since 
there is a limitation in the total number of registers in one block. 

3.7 Description of MTGP : tempering 

Analogous to MT, MTGP transforms the sequence x^ by a fixed linear trans- 
formation to obtain better k(v) (called tempering): the {N + i)-t}i output ojv+j 
is given by 

On+i ■■= XTv+j © /l(xM-l+i) (9) 

for a suitably chosen linear transformation h so that the output sequence has a 
high value of k{v) (u = 1, . . . , w). 

Definition 3.4 The tempering (0) of MTGP is defined as follows, using a tem- 
porary variable t of w-hit integer: 

t ^ XM-i+i © {yiM-i+i » 16) 
t ^ t © (t > 8) 

ON+^^^N+^®t^T (10) 

with the computation in this order. Here, T is a (4 x w)-matrix over ¥2. The 
defining parameter is T, called the tempering parameter. The linear transfor- 
mation /i(xM-i+i) appearing in (0) is defined as t T appearing in UP]) . 
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Figure 2: Circuit-like description of MTGP: the right circuit is for recursion, 
the left circuit for the tempering. 



This type of tempering using another word xjvz-i+i in the state array came 

from [^ . Use of the multiplication by a (4 x w)-matrix was obtained through 

trial-and-error to obtain a fast tempering with a large value of k{v) {1 < v < 32). 

An equivalent description to the tempering (jlOp in C-language is as follows: 



(11) 



t = X[M-l+i] A (X[M-l+i] » 16); 
t = t A (t » 8); 
return (X[N+i] A tmptbl[t & Oxf]); 



Similarly to the look-up table rectbl in the recursion ([S]), tmptbl is an array 
of length 16 storing the values (OOOO)T, (OOOl)T, . . ., (llll)T. 

Altogether, one MTGP has the recursion parameters as in Definition 13.31 
and the tempering parameter as in Definition 13.41 A circuit-like description of 
MTGP is shown in Figure [H 



3.8 Dimensions of equidistribution of MTGP 

MTGP supports three different periods 2"2i3 _ i^ 223209 _ i and 2^4497 _ i. por 
each period, we searched for 128 different parameter sets with good equidistri- 
bution properties, sorted in terms of the total defect A. So users can generate 
128 distinct streams by MTGPs of different parameter sets. 

Table m lists k{v) and d{v) of the MTGP23209 of period 2^3209 _ i foj. ^j^e 
first parameter set among the obtained 128 sets. In addition, the defect ratio 
100 X d{v)/[p/v\ in % is listed. For u = 1, . . . , 16, d{v) is or 1, which means 
that the equidistribution up to 16-bit accuracy is close to the optimal. For 
comparison, Mersenne Twister MT19937 has d(l) = d{2) = d(4) == d(8) = 0, 
but d(3) = 405, d(5) = 249, d(6) = 207 and d{7) = 355 for 1 < u < 8. The 



Table 1: k{v), d{v) and the defect ratio r{v) 
of MTGP23209 with first parameter set 
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Total dimension defect A is 1141. 



total dimension defect A = 1141 of MTGP23209 is better than A = 6750 of 
MT19937. 

The WELL 221 generator has optimal A = 0, but it seems difficult to run 
WELL on GPUs efficiently because of the heavy dependencies among the partial 
computations in the recursion. 

The last (and worst) among the 128 parameter sets for MTGP23209 has 
A = 2100, still much better than MT19937. 



4 Comparison with other generators on GPU 

4.1 SDK-MT: a naive approach 

As mentioned earlier in ij3.1[ a more naive idea for PRNG for GPU is to use 
one PRNG for one thread, unlike for one block as in MTGP. An example is 
SDK Mersenne- Twister (SDK-MT) in CUDA-SDK [H] sample programs from 
NVIDIA, which is a set of 4096 (small) Mersenne Twister implementations on 
CUDA. SDK-MT uses 32 blocks, each block consists of 128 threads, and each 
thread runs a Mersenne Twister with 607-bit state space (MT607). Each of the 
32 X 128 = 4096 threads uses a distinct set of parameters, produced by Dynamic 
Creator for MT 18^ to support the independence of those streams. 

Figure [3] pictures one block (128 threads) for SDK-MT. These parameters 
(e.g., the coefficients in the recursion) are kept in the global memory. 
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Figure 3: SDK-MT: PRNGs for GPU, one thread for one generator. 
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4.2 Merits of MTGP over SDK-MT 

The parahelization in MTGP, namely using the parallehsm naturally appearing 
in the recursion in the array, has been known since the 1980s (see for example 
[D §5.2.1]). Its merits compared to SDK-MT are: 

• SDK-MT's consumption of memory counted in bits is (607-|-parameter size) x 
the number of threads. 

• MTGP's consumption is 32 x the number of threads. 

• As an illustrative example, assume that the size of the shared memory 
is 16Kbyte and that the state spaces of SDK-MT are allocated in the 
shared memory. Then, the number of parallel threads is small: namely 
(16Kbyte)/(size of working space for MT607)<400, losing its gain in par- 
allelism. 

• The period of generated sequence: SDK-MT has period 2^°^ — 1, while 
MTGP has period 2^^'^^^ — ! and higher dimensional equidistribution prop- 
erty. 



4.3 Other GPU-based PRNGs 

We compare MTGP with some other generators implemented for GPUs. We 
do not treat classical Linear Congruential Generators with modulus less than 
or equal to 2^^, such as the Park-Miller generator implemented for GPUs |11) . 
since they are not recommendable for massive use (its period is < 2'^^, and the 
sequence is rejected by several simple statistical tests called SmallCrush, see 
[H] and [IH]). We consider the following generators. 

• The MTGP generator with period of 2^^^^^ — 1 described above. We 
use 108 blocks to run 108 MTGPs whose characteristic polynomials are 
mutually distinct. 

• The SDK-MT generator with period of 2*5°^ - 1 described in Section |4TT1 



11 



Table 2: Consumed time for 5 x 10^ numbers by five PRNGs, their periods, 
and the results of BigCrush statistical tests: "pass*" means passed all the tests, 
"pass" means passed the tests except those on F2-linearity, "reject" means re- 
jected systematically by at least one of the tests not based on F2-linearity. 
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float[0,l) 


] 
32bit int 
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5.0ms 
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Period 


260V _ 1 


2ii2ia _ 2 


^2^^^ 


ouy/i _ 1 


^ 2^y^ 


BigCrush 


reject 


pass 


pass 


pass* 


reject 



• The HybridTaus generator ^Oj Example 37-4] in GPU Gems, with period 
of around 2^^^. This generator outputs the exor of two 32-bit integer 
streams, one generated by a combined Tausworthe generator taus88 [H] 
and the other by a linear congruential generator "Quick and Dirty." 

• Warp Generator [28]. This generator is based on a sparse F2 -linear tran- 



sition matrix M : ¥' 

■|p32x32 



32x32 



F: 



32x32 



Its period is 2 



1024 



1. The space 



is identified with an array of 32-bit integers of length 32, so that 
a warp computes the multiplication of M. The content of the array is 
considered as 32 of 32-bit random integers. The raw output (WarpCor- 
related) does not pass the BigCrush test suite, but by combining with a 
non F2-linear output function (sum of two 32-bit integers modulo 2^^), the 
output (WarpStandard) passes the BigCrush test suite. In this article, the 
Warp Generator means WarpStandard. 

The CURAND generator [H]. This is the newest in the NVIDIA SDK. 
This is an implementation of the xorwow generator, which is the com- 
bination of an F2-linear generator xorshift and a Weyl generator (i.e., a 
linear congruential generator with multiplier 1) by addition modulo 2^^, 
proposed in [TB]. Its period is 2^^^ — 2^^. 



4.4 Generation Speed 

Comparison of speed 

We measured the consumed time in milliseconds, for generating 5 x lO'' single 
floating-point numbers in the range [0,1) for MTGP11213, SDK-MT, Hybrid- 
Taus, WarpStandard, and CURAND, on the GeForce GT120 GPU (4 cores) 
and the GeForce GTX260 GPU (27 cores). For MTGP11213, we also mea- 
sured the time for generating unsigned integers and floating-point numbers in 
[1,2) (see gnifor the reason). Table [2] shows the results. On both GT12G and 
GTX260, SDK-MT is the slowest, and WarpStandard and CURAND are faster 
than MTGP. On GT120, CURAND is 2.4 times faster than MTGP, while on 
GTX260 the factor decreases to 1.7. 

Since CPU's and CPU's have inherently different purposes, there may be 
no fair way to compare their performance, but for a reference, we show that 
MTGP is faster than a PRNG on high-spec CPUs. SFMT [35], which is one 
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of the fastest generators on SIMD machines, takes 25nis to generate the same 
5x 10^ number of pseudorandom 32-bit integers on an Intel Xeon X5570 2.93GHz 
4 core 2 CPU using 1 process (i.e. one core in one CPU), while MTGP takes 
4.6ms as shown in the table. We also tried to use 4 cores of Xeon, but it turns 
out to be slower than using one core. This seems due to the overhead time for 
thread generation and synchronization (we use pthread in POSIX to generate 
four threads in the Xeon). 

We mention that there are studies on hardware implementations of Mersenne 
Twisters. For example, JSOj implemented MT19937 on both a FPGA hardware 
and on a GPU. They used the 8800 GTX GPU which is a little slower than the 
GTX260. Their implementation in the GPU takes 16.9ms to generate the same 
number of 32-bit integers as above, hence slower than MTGP on the GTX260. 
On the other hand, their FPGA implementation is 35 times faster than their 
GPU implementation and hence is much faster than MTGP on GTX260. 

4.5 Statistical tests 

We conducted TestUOl statistical test suits [H] to the MTGPs and the four 
other generators in the previous section. 

The Crush library in TestUOl contains 96 tests and the BigCrush library 
contains 106 tests. See the user's guide for details on the tests. The results of 
the tests show that MTGP, HybridTaus and Warp are not rejected, but SDK- 
MT and CURAND are systematically rejected, as explained below. 

MTGP Among 160 tests in the BigCrush library, four tests based on F2- 
linearity reject MTGPs. Two LinearComp tests (test number 80 and 81), 
which measure linear complexity of the sequence, reject all the MTGPs. 
Two MatrixRank tests (number 70 and 71), which measure the F2-linear 
rank of 5000x5000 matrices generated from the output bit stream, reject 
the MTGPs with a state size less than 5000 bits. This rejection is com- 
mon among F2-linear generators such as Mersenne Twister (see [TJ) and 
WELL generators [23]. Because these statistics are based on F2-linear 
relations of each bit of the outputs, the observed deviation would hardly 
cause problems when the sequence is used as real numbers or integers in 
a simulation. Other tests in the BigCrush library show no systematic de- 
viation of MTGPs (including those generated by MTGPDC explained in 
the next section). 

HybridTaus Similarly to MTGPs, HybridTaus passed the tests except Ma- 
trixRank tests and LinearComp tests in the BigCrush. 

SDK-MT SDK-MT is rejected by Random Walk tests (number 76) in addition 
to the above mentioned F2-linearity tests in the BigCrush. It passes the 
other tests. 

CURAND CURAND is systematically rejected by three tests in the BigCrush: 
CoUisionOver (number 7), SimpPoker (number 27) and LinearComp (num- 
ber 81). The p-values are outside the interval [10~-^^, 1 — 10^^^]. This 
phenomenon has been reported by [3]. Since the first test is on the ran- 
dom numbers in the interval [0, 1), the observed deviation may cause some 
erroneous results in a serious simulation. 
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The defect of CURAND becomes more serious when we examine the differ- 
ences between the successive outputs: namely, from the output sequence 
Xn € [0,1), we make a new sequence ?/„ := Xn — a;„_i mod 1 e [0,1). 
Among the BigCrush battery, three CollisionOver tests (number 7, 8, 10), 
one Gap test (number 36), one RandomWalk test (number 75), and one 
Run of Bits test (number 102) show p-values outside [10~^^, 1 — 10~^^] for 

Even among the Crush battery, two CollisionOver tests (number 7, 8), 
one ClosePairsN Jumps test (number 20), and one Gap test (number 32) 
show p- values outside [10^^,1 — 10^^] systematically. 

Weirp WarpStandard generator passes all the tests in the BigCrush library. 

5 MTGPDC 

Dynamic Creation of pseudorandom number generators [18: is proposed for large 
scale parallel simulations in which a number of PRNGs with distinct parameters 
are desired. Dynamic Creator for Mersenne Twister (MTDC) is released online: 
http: //www. math. sci .hiroshima-u. ac . jp/~m-mat/MT/DC/dc .html, 

MTDC receives a 16-bit integer called ID, and generates a parameter set for a 
Mersenne twister with specified (Mersenne prime) period, with the ID embedded 
in the recursion, so that distinct ID assures a distinct (hence relatively prime) 
irreducible characteristic polynomial of the transition function of the generator. 

One problem of MTDC is that there are some IDs where MTDC can not 
find a parameter set with the specified period. This phenomenon is observed 
only when w = 31. The ID is embedded as the 16 MSBs of the parameter a of 
MT (see ^3.3\i , and MTDC searches for the remaining bits of a that attain the 
maximal period, but sometimes no such bit-pattern exists. Another problem is 
that the 16-bit space for the ID is somewhat narrow for the recent trend of high 
parallelism. 

Our proposed MTGPDC receives a 32-bit integer as an ID, which is em- 
bedded in the (4 x 32) matrix parameter R in Definition 13.31 Although there 
is no assurance that distinct IDs give distinct characteristic polynomials, there 
is a heuristic argument that with high probability they would do so, since the 
32 bits are mapped to p (say, 11213) bits of the coefficients of the characteris- 
tic polynomial in a complicated way. For the sake of completeness, MTGPDC 
calculates the SHAl digest of the coefficients of the characteristic polynomial, 
so that users can check whether the characteristic polynomials are different (by 
checking that the SHAl digests are different) if they want. 

As pointed out by a referee, there is no theoretical assurance for the inde- 
pendence of the generated streams even if the characteristic polynomials are 
distinct. However as far as we know, there are no reports on the deviation of 
the correlations among two such F2-linear generators. We also tested a merged 
sequence from the outputs of two parameter sets of MTGPs of the same period 
using the BigCrush library, but no deviation is observed. 

As explained in Definitions 13.31 and 13.41 MTGP has two kinds of param- 
eter sets, namely recursion parameters and tempering parameters. Recursion 
parameters decide the recursion and hence the period of the generator. The 
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Figure 4: A recursion parameter R. For the efficiency of the search, the four sets 
of 4 LSBs in R are chosen so that the corresponding 4x4 matrix does not have 
an eigenvalue 1, which is equivalent to the invertibility of the transformation 
on u in the last line of ([7]). This is a necessary condition to have the maximal 
period. 



tempering parameter T changes the output function and decide the dimensions 
k{v) of equidistribution. 

MTGPDC first searches for the recursion parameter set {N, M, w, r, shl, sh2, R) 
in Definition 13.31 and then the tempering parameter set (Definition 13. 4p . Once 
the Mersenne prime period 2^ — 1 and the wordsize w are fixed, then TV = [p/w] 
and r = Nw — p are determined. MTGPDC embeds the received ID in the ma- 
trix R, as follows. 

Figure |3] describes the recursion parameter R, which consists of four 32- 
dimensional row vectors over F2. These vectors are identified with 32-bit inte- 
gers, with the MSB at the left-most component of the vector. The given 32-bit 
ID is separated into the 16 MSBs and the 16 LSBs. The former are embedded 
in the 16 MSBs of the first row. The latter are embedded in the 16 bits of the 
second row consisting of the 12th, 13th, . . ., and 28th MSB. 4 LSBs of each 
row are selected randomly so that the corresponding (4 x 4) matrix plus the 
identity matrix is invertible (this is a necessary condition for the last line in ([7]) 
being invertible, which is necessary for the irreducibility of the characteristic 
polynomial). Other parts of matrix R are randomly selected. 

Shift parameter shl and sh2 are fixed to 13 and 4 respectively, and the 
middle position M is randomly selected so that 2 < M < N — [N\2- Experi- 
mentally, we found that some shift parameters gave rather large defect A even 
after tempering, so we decided to fix these shift parameters which give good A 
empirically. For each selection of a parameter set, the characteristic polynomial 
is calculated using Berlekamp-Massey algorithm implemented in Number The- 
ory Library [57]. If the polynomial is reducible, the next possible parameter set 
(with ID still embedded in R) is randomly generated, until the irreducible poly- 
nomial is found. Since the degree of the irreducible polynomial is a Mersenne 
exponent, the polynomial is primitive and the maximal period 2^ — 1 is attained. 
According to our experiments, probably due to the large degree of freedom in 
R, MTGPDC finds a parameter set for any given ID, unlike the case of MTDC 
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Table 3: Time (sec) for recursion and tempering parameter search 
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379 
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3893 




average 


21.7 


34.1 


213.7 
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3236.4 



which fails to find one for some ID when w = 31. 

Once the recursion parameter set with maximal period is found, MTGPDC 
searches for a tempering parameter set. The tempering parameter is a 4 x 32 
matrix (realized by a look-up table tmptbl) which consists of four 32-bit vectors 
(see (fTO]) '). MTGPDC searches for tempering parameters as follows: 

• prepare an array tmp[0..3] of 32-bit integers. 

• set the four components of tmp[0..3] to 0. 

• search for the 23 MSBs of the parameters 
for i = to 3 do 

for j = to 20 step 5 do 

• e — min(j + 5, 23) 

• generate all bit patterns of j-th to e-th bits of tmp[i]. 

• calculate fc(l), fc(2), . . . fc(e) for each bit pattern. 

• fix the bit pattern that attains the least sum of the dimension defects 
d{l)+d{2) + --- + d{e). 

end for 
end for 

• modify the 9 LSBs of the parameters found above according to the distri- 
bution of 9 LSBs 

for i = to 3 do 

for j = to 9 step 5 do 

• e = min(j -|- 5, 9) 

• generate all bit patterns of 31-j-th to 31-e-th bits of tmp[i]. This does 
not change the 23 MSBs of the parameters. 

• calculate fc'(l) to fc'(e), where k'{v) means the dimension of equidistri- 
bution for the v LSBs. 

• fix the bit pattern that attains the least sum of the dimension defects 

d'(l) + (i'(2) H h d'{e), where d'{v) means the dimension defect for the 

V LSBs. 

end for 
end for 
Computing the dimension of equidistribution is a time-consuming part in 
MTGPDC. We used the SIS algorithm [7] to compute these dimensions, which 
reduces the computation time considerably. (Note that a faster method [6j has 
recently been developed.) 

Figure [S] shows the distribution of the CPU time for recursion-parameter 
searches for p = 11213. The minimum and maximum A among 1500 parameter 
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Figure 5: Distribution of time (sec) for recursion parameter search {p = 11213) 



sets are 565 and 3542, respectively. Table [3] shows the minimum, maximum and 
average time (sec.) to find one MTGP-parameter set for various p. The upper 
three rows show CPU times to find recursion-parameter sets and the lower three 
rows show CPU times to find tempering-parameter sets. Times in Figure [S] and 
Table [3] are measured on an Intel Xeon 5500 2.26GHz 4 core x 2 CPU, running 
15 MTGPDC processes simuhaneously. These timings show that MTGPDC 
is too slow to use during a simulation. MTGPDC should be used to create a 
sufficient number of parameter sets before the simulation, and these sets should 
be reused. 

MTGPDC records a set of recursion and tempering parameters in comma- 
separated values format in a file, together with the following additional data: 
SHAl digest of the characteristic polynomial, number of non-zero terms of the 
characteristic polynomial, and the total dimension defect. 

Detectable deviations in LSBs in an older version of MT- 
GPDC 

[221 reported the following problem on the former version of MTGPDC. In 
MTGPDC version 0.2 (released on June 8th 2009) for w — 32, several (say seven) 
LSBs of 32-bit outputs are often rejected by some tests in the BigCrush library 
for some parameter sets. Among these tests, two notable tests are sknuth_Gap 
test for 5-bit sequences consisting of the 26th, 27th, 28th, 29th, and 30th bit 
of 32-bit outputs (Test 35) and sstring_HammingIndep test for the same bits 
(Test 100). Test 35 rejects 1/5 of the 10000 created parameter sets, while 
Test 100 rejects 1/10. 

This older MTGPDC has only 23 MSBs for the tempering parameters. Af- 
ter this report, the remaining 9 LSBs in the tempering parameters are supple- 
mented, as described in this section. 
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6 Floating point generation in IEEE745 

If one generates random integer sequences and transforms them into uniform real 
numbers by multiplication or division, the conversion time is not negligible. For 
current NVIDIA CUDA-enabled GPUs, such conversion is not heavy (equivalent 
to 5 additions), but for present-day AMD GPUs, the integer-floating conversion 
is rather time-consuming. 

These days, most computers use IEEE 754 format for both single and double 
precision floating-point numbers. Generating floating points in this format is 
faster than conversion, see |26| . 

The same idea is used in MTGP. IEEE 754 single-precision format uses 32 
bits for representing a real number. It consists of a 1-bit sign part (the MSB), 
a 8-bit exponent part and a 23-bit fraction part. To generate single precision 
floating point numbers, the last line of the tempering (jlOp is changed to: 

Oj = (xi > 9) A sngltbl[t & OxOf]; (C) 

where sngltbl is essentially the same as tmptbl, just formatted to produce 
single floating point format in IEEE754, as explained below. Every component 
sngltbl [i] (0 < i < 15) has nine MSBs which are equal to 001111111, and its 
27 LSBs are the 27 MSBs of tmptbl [i] . The above ([C]) amounts to producing 
a 32-bit integer sequence whose nine MSBs are 001111111 and 27 LSBs are the 
27 MSBs of the 32-bit integer version of MTGP. The sign-bit and exponent- 
bit imply that the represented real number is in the range [1, 2), and the 27 
uniformly random LSBs imply the uniformity in that range. By subtracting 1.0, 
we obtain uniform distribution in [0, 1). It is often the case that we can use [1, 
2) directly, for example the Box-Muller transformation for Gaussian distribution 
(see 1261 §2]). 

7 Concatenating outputs of several generators 

Occasionally practioners ask us whether it is possible to generate a stream of a 
single generator (say, of MT19937) using many processors. It would be possible 
by using jumping-ahead. Here, "jumping-ahead by N steps" means to compute 
the state after generating TV outputs from the present state. If the state has 
p bits, then jumping-ahead costs 0{p^) operations. Even after some improve- 
ments, one jumping-ahead costs a few milliseconds for p = 19937 ((4"), which 
seems too costive. 

They worry that concatenating outputs of several sequence generators may 
result in a poorly random sequence, since the assurance of the dimension of 
equidistribution k(v) will be lost. 

In our experience, concatenating the output sequences of good pseudoran- 
dom number generators does not cause trouble. We think the dimension of 
equidistribution k{v) is often mis-interpreted. The k(y) is the dimension of 
uniformity for the whole period, and it gives no assurance for subsequences. 
However, it gives assurance on nonexistence of a linear relation: among the 
consecutive fc-tuples of v bits, there is no F2-linear relation which holds forever. 
This property is inherited by the subsequences, and is preserved when concate- 
nating outputs of several generators with the same property. In addition, in the 
cases of DC or MTGPDC, the generators have distinct irreducible characteristic 
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polynomials. This property does not necessarily assure the absence of correla- 
tions among these generators, but the risk of correlations would be smaller than 
the case where generators share one same recursion and have different seeds. 

8 Implementation using CUDA 

To use (the present version of) MTGP in a GPGPU program, one needs to write 
the code in CUDA [21]. 

A user needs to write both a host program (processed in the CPU) and a 
kernel program (processed in the GPU). A flow chart is in Figure [6l The left 
hand side is the host program. User application program (User AP) means the 
code which the user wants to run on GPGPU, using the pseudorandom numbers 
generated by MTGP. In the host program, initialization (or set up) of the GPU 
for the User AP kernel program is executed. Then, the initialization of the GPU 
for MTGP (such as texture memory set up for look-up tables, initialization of 
the state array in the global memory) is executed. Then, the host program 
calls the MTGP-kernel program (processed in the GPU) . In the kernel call, the 
host specifies (1) number of blocks and (2) number of threads in a block, and 
the following arguments are passed to the kernel program (or equivalcntly to 
each block) (3) the number of pseudorandom numbers to be generated for each 
thread (4) a pointer to global memory pointing to the state array (5) a pointer 
to global memory pointing to the area where the generated numbers are stored. 
When the GPU is called, each block reads the state from global memory to 
shared memory, then each thread generates pseudorandom numbers and writes 
them in the global memory. After writing a specified number of outputs, each 
block stores the present state back into the global memory, so that in the next 
MTGP kernel call the generation continues. To have a consecutive array of 
pseudorandom numbers, we need to wait for termination of all the blocks. This 
is done in the host program, using the synchronization function. Then, the 
User AP kernel program is called from the host. The User AP kernel program 
is executed on the GPU, using the array of generated pseudorandom numbers 
generated by MTGP. 

One drawback of this scheme is that the user needs to specify the number 
of pseudorandom numbers to be generated, before calling the User AP kernel. 

9 Conclusions 

We proposed MTGP pseudorandom number generators suitable for graphic pro- 
cessing units. The strategy to raise its efficiency is to apply many parallel 
threads to a single state space of a large pseudorandom number generator, to 
realize a large state space (and hence a long period and high-dimensional equidis- 
tribution) without loss of its generation speed. The state space is allocated in 
the shared memory in the GPU, and its parallelism of memory access is suitable 
for the design of the banks of the memory in the GPU. The generation speed of 
MTGP is comparable to some of the fastest generators for the GPU. For exam- 
ple, CURAND generator is faster than MTGP by a factor of 1.7, but statistical 
tests show some weaknesses of CURAND. On the other hand, WarpStandard is 
also faster than MTGP and passes the BigCrush statistical test suites. ^From 
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Figure 6: Flow chart for MTGP host, kernel programs and application program. 
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the viewpoint of the period and high-dimensional equidistribution, MTGPs have 
better assurance than other generators. 

We also designed a parameter generator for MTGP, named MTGPDC. It 
runs on CPUs, receives the period and 32-bit ID, searches for a recursion param- 
eter set with the ID embedded, and then searches for a tempering parameter set 
to attain high-dimensional equidistribution. This facility fits a GPGPU with 
many nodes. The source code of MTGP and MTGPDC, together with their 64- 



bit variants, are available from the url http : //www . math . sci . hiroshima-u . ac . jp/~in-mat/MT/MTGP/ 
MTGPs passed the statistical tests in the BigCrush library, except for those 
tests which measure F2-linear dependency of the output sequence. The failure 
in such tests is common to the Mersenne Twister and the WELL, which would 
not cause a problem in a usual stochastic simulation. 
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